{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Semantic Double Merging and Chunking"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Download spacy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install spacy"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Download required spacy model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!python3 -m spacy download en_core_web_md"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Download sample data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'pg_essay.txt'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.core.node_parser import (\n",
    "    SemanticDoubleMergingSplitterNodeParser,\n",
    "    LanguageConfig,\n",
    ")\n",
    "from llama_index.core import SimpleDirectoryReader"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Load document and create sample splitter:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "documents = SimpleDirectoryReader(input_files=[\"pg_essay.txt\"]).load_data()\n",
    "\n",
    "config = LanguageConfig(language=\"english\", spacy_model=\"en_core_web_md\")\n",
    "splitter = SemanticDoubleMergingSplitterNodeParser(\n",
    "    language_config=config,\n",
    "    initial_threshold=0.4,\n",
    "    appending_threshold=0.5,\n",
    "    merging_threshold=0.5,\n",
    "    max_chunk_size=5000,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Get the nodes:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nodes = splitter.get_nodes_from_documents(documents)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Sample nodes:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "\n",
      "What I Worked On\n",
      "\n",
      "February 2021\n",
      "\n",
      "Before college the two main things I worked on, outside of school, were writing and programming.  I didn't write essays.  I wrote what beginning writers were supposed to write then, and probably still are: short stories.  My stories were awful.  They had hardly any plot, just characters with strong feelings, which I imagined made them deep.\n",
      "\n",
      " The first programs I tried writing were on the IBM 1401 that our school district used for what was then called \"data processing.\"  This was in 9th grade, so I was 13 or 14.  The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it.  It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights.\n",
      "\n",
      " The language we used was an early version of Fortran.  You had to type programs on punch cards, then stack them in the card reader and press a button to load the program into memory and run it.  The result would ordinarily be to print something on the spectacularly loud printer.\n",
      "\n",
      " I was puzzled by the 1401.  I couldn't figure out what to do with it.  And in retrospect there's not much I could have done with it.  The only form of input to programs was data stored on punched cards, and I didn't have any data stored on punched cards.  The only other option was to do things that didn't rely on any input, like calculate approximations of pi, but I didn't know enough math to do anything interesting of that type.  So I'm not surprised I can't remember any programs I wrote, because they can't have done much.  My clearest memory is of the moment I learned it was possible for programs not to terminate, when one of mine didn't.  On a machine without time-sharing, this was a social as well as a technical error, as the data center manager's expression made clear.\n",
      "\n",
      " With microcomputers, everything changed.  Now you could have a computer sitting right in front of you, on a desk, that could respond to your keystrokes as it was running instead of just churning through a stack of punch cards and then stopping.  [1]\n",
      "\n",
      "The first of my friends to get a microcomputer built it himself.  It was sold as a kit by Heathkit.  I remember vividly how impressed and envious I felt watching him sitting in front of it, typing programs right into the computer.\n",
      "\n",
      " Computers were expensive in those days and it took me years of nagging before I convinced my father to buy one, a TRS-80, in about 1980.  The gold standard then was the Apple II, but a TRS-80 was good enough.  This was when I really started programming.  I wrote simple games, a program to predict how high my model rockets would fly, and a word processor that my father used to write at least one book.  There was only room in memory for about 2 pages of text, so he'd write 2 pages at a time and then print them out, but it was a lot better than a typewriter.\n",
      "\n",
      " Though I liked programming, I didn't plan to study it in college.  In college I was going to study philosophy, which sounded much more powerful.  It seemed, to my naive high school self, to be the study of the ultimate truths, compared to which the things studied in other fields would be mere domain knowledge.  What I discovered when I got to college was that the other fields took up so much of the space of ideas that there wasn't much left for these supposed ultimate truths.  All that seemed left for philosophy were edge cases that people in other fields felt could safely be ignored.\n",
      "\n",
      " I couldn't have put this into words when I was 18.  All I knew at the time was that I kept taking philosophy courses and they kept being boring.  So I decided to switch to AI.\n",
      "\n",
      " AI was in the air in the mid 1980s, but there were two things especially that made me want to work on it: a novel by Heinlein called The Moon is a Harsh Mistress, which featured an intelligent computer called Mike, and a PBS documentary that showed Terry Winograd using SHRDLU.  I haven't tried rereading The Moon is a Harsh Mistress, so I don't know how well it has aged, but when I read it I was drawn entirely into its world.  It seemed only a matter of time before we'd have Mike, and when I saw Winograd using SHRDLU, it seemed like that time would be a few years at most.  All you had to do was teach SHRDLU more words.\n",
      "\n",
      " There weren't any classes in AI at Cornell then, not even graduate classes, so I started trying to teach myself.  Which meant learning Lisp, since in those days Lisp was regarded as the language of AI.  The commonly used programming languages then were pretty primitive, and programmers' ideas correspondingly so.  The default language at Cornell was a Pascal-like language called PL/I, and the situation was similar elsewhere.  Learning Lisp expanded my concept of a program so fast that it was years before I started to have a sense of where the new limits were.  This was more like it; this was what I had expected college to do. \n"
     ]
    }
   ],
   "source": [
    "print(nodes[0].get_content())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "I hung around Providence for a bit, and then my college friend Nancy Parmet did me a big favor.  A rent-controlled apartment in a building her mother owned in New York was becoming vacant.  Did I want it?  It wasn't much more than my current place, and New York was supposed to be where the artists were.  So yes, I wanted it!  [7]\n",
      "\n",
      "Asterix comics begin by zooming in on a tiny corner of Roman Gaul that turns out not to be controlled by the Romans.  You can do something similar on a map of New York City: if you zoom in on the Upper East Side, there's a tiny corner that's not rich, or at least wasn't in 1993.  It's called Yorkville, and that was my new home.  Now I was a New York artist — in the strictly technical sense of making paintings and living in New York.\n",
      "\n",
      " I was nervous about money, because I could sense that Interleaf was on the way down.  Freelance Lisp hacking work was very rare, and I didn't want to have to program in another language, which in those days would have meant C++ if I was lucky.  So with my unerring nose for financial opportunity, I decided to write another book on Lisp.  This would be a popular book, the sort of book that could be used as a textbook.  I imagined myself living frugally off the royalties and spending all my time painting.  (The painting on the cover of this book, ANSI Common Lisp, is one that I painted around this time.)\n",
      "\n",
      " The best thing about New York for me was the presence of Idelle and Julian Weber.  Idelle Weber was a painter, one of the early photorealists, and I'd taken her painting class at Harvard.  I've never known a teacher more beloved by her students.  Large numbers of former students kept in touch with her, including me. \n"
     ]
    }
   ],
   "source": [
    "print(nodes[5].get_content())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Remember that different spaCy models and various parameter values can perform differently on specific texts. A text that clearly changes its subject matter should have lower threshold values to easily detect these changes. Conversely, a text with a very uniform subject matter should have high threshold values to help split the text into a greater number of chunks. For more information and comparison with different chunking methods check https://bitpeak.pl/chunking-methods-in-rag-methods-comparison/"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
